The problem at hand is credit card fraud prediction. The goal is to develop a model that can accurately identify fraudulent transactions from legitimate ones based on the given dataset. The expected benefits of this project include:
Reduced financial losses for credit card companies and cardholders
Improved customer trust and satisfaction
Enhanced security measures for credit card transactions
More efficient allocation of resources for fraud investigation
Fraudulent transactions often stand out due to unusual patterns like sudden large amounts, frequent purchases in quick succession, or transactions originating from unfamiliar locations, making fraud detection a complex task. The issue is further complicated by data imbalance since fraudulent transactions are much rarer than legitimate ones. This imbalance needs careful management to avoid bias in model training. Important features such as transaction amount, location, time, merchant details, and historical customer behavior are crucial for improving model accuracy. Additionally, any solutions must comply with data privacy regulations like GDPR, which dictate how personal data is managed. The ever-evolving strategies of fraudsters require fraud detection systems to be regularly updated and adapted. Achieving a good balance is essential to minimize false positives, which can annoy customers, and false negatives, which can cause financial losses, making precision and recall equally important.
Data Preparation & Feature Engineering
Raw Data
The Credit Card Fraud Prediction dataset offers a variety of attributes valuable for comprehensive analysis. It contains 555,719 instances and 22 attributes, a mix of categorical and numerical data types. Importantly, the dataset is complete with no null values. Below is a breakdown of the attributes.
Field
Description
trans_date_trans_time
Transaction date and time
cc_num
Unique customer identification number
merchant
The merchant involved in the transaction
category
Transaction type (e.g., personal, childcare)
amt
Transaction amount
first
Cardholder’s first name
last
Cardholder’s last name
gender
Cardholder’s gender
street
Cardholder’s street address
city
Cardholder’s city of residence
state
Cardholder’s state of residence
zip
Cardholder’s zip code
lat
Latitude of cardholder’s location
long
Longitude of cardholder’s location
city_pop
Population of the cardholder’s city
job
Cardholder’s job title
dob
Cardholder’s date of birth
trans_num
Unique transaction identifier
unix_time
Transaction timestamp (Unix format)
merch_lat
Merchant’s location (latitude)
merch_long
Merchant’s location (longitude)
is_fraud
Fraudulent transaction indicator (1 = fraud, 0 = legitimate). This is the target variable for classification purposes
For easier downstream processing and analysis, we convert is_fraud and gender to boolean variables.
Code
is_fraud = pl.col("is_fraud").cast(pl.Boolean)# Encode Male as False (0), Female as True (1)gender = (pl.col("gender") =="F").cast(pl.Boolean)
Transaction Date and Time
The trans_date_trans_time feature can be transformed to extract meaningful temporal information. Fraudulent transactions may occur at unusual times or exhibit different patterns compared to legitimate ones. Therefore, we extract the following features:
tx_datetime: Typed datetime object for the transaction timestamp which is represented as string in the raw data.
tx_hour: Time of day when the transaction occurred.
tx_day_of_week: To capture weekly patterns.
tx_is_weekend: Boolean indicating if the transaction occurred on a weekend.
As a unique identifier of the credit card holder, cc_num doesn’t provide value for telling fraudulent transactions apart from legitimate ones. However, it can be used to engineer features that capture card-holder-specific behavior. We create the following features:
is_frequently_visited_merchant: Boolean indicating if the cardholder frequently visits the same merchant, where “frequent” is defined as more than 2 transactions with the same merchant.
amt_median: Median transaction amount for the cardholder at the given merchant. Allows us to compute the deviation of each transaction at a merchant from the cardholder’s typical spending behavior at that merchant.
Code
cc_num_merchant_agg_df = df.group_by("cc_num", "merchant").agg( is_frequently_visited_merchant=pl.len() >2, amt_median=pl.col("amt").median(),# The given customer fell victim to fraud this many times num_frauds_suffered=pl.sum("is_fraud"),)cc_num_merchant_agg_df.head()
shape: (5, 5)
cc_num
merchant
is_frequently_visited_merchant
amt_median
num_frauds_suffered
f64
str
bool
f64
i64
4.4495e15
"fraud_Hyatt, Russel and Gleich…
false
67.705
0
5.6540e11
"fraud_Ruecker-Mayert"
false
49.1
0
3.5893e15
"fraud_Bins-Rice"
true
50.22
0
4.2479e12
"fraud_Wiza LLC"
false
3.67
0
4.4811e12
"fraud_Ernser-Lynch"
false
15.79
0
We join these features to the original dataset on the cc_num field.
merchant is a high-cardinality categorical variable (\(693\) unique occurrences) carrying the name of the merchant by which a transaction happened. We could use frequency encoding to transform this feature into a numerical one. However, as already have access to the category feature, which provides a more general description of the transaction type, we decide to drop merchant, as the category of a transaction in essence is a clustered representation of the merchant.
We use merch_lat and merch_long to calculate the distance between the merchant and the cardholder. This can be a useful feature, as fraudulent transactions may occur when the merchant is located far from the cardholder. We compute the following features:
distance_from_merch: The spheroid distance between the merchant and the cardholder
The amt feature can be used directly for detecting anomalies in transaction amounts. We perform a logarithmic transformation to normalize the distribution of transaction amounts, to account for the skewed nature of the data (many small, \(\leq \$100\), transactions and a few large, \(\geq \$1000\), transactions). We create the following features:
city_pop may indicate the likelihood of fraud occurring in certain population densities. We create the following feature:
city_pop_cat: Categorical representation of the city population determined based on a threshold to bin populations into categories (e.g., rural, suburban, urban)
Categorizing population sizes into discrete groups like village, town, and city makes the model’s results more interpretable and easier to understand for stakeholders1. This classification aligns with how people typically think about settlement sizes, making it simpler to communicate findings and insights.
Also, if the relationship between population size and the target variable is non-linear, discretization can help capture these complex patterns18. For example, there might be distinct differences in certain characteristics between villages, towns, and cities that are not proportional to their population sizes.
Under the assumption that certain individuals are more likely to be victims of credit card fraud, the attributes we have about the cardholder are likely to have major predictive power. We ignore first and last name columns, as they are personal identifiers, so inherently not useful for clustering. We encode gender as binary variable. We ignore the street attribute as it is too granular, but we encode city, state and zip directly. We use lat and long to calculate the distance between the cardholder and the merchant. Using dob and tx_datetime we also calculate the age of the cardholder at the time of the transaction.
Furthermore, as the job title of the cardholder may be indicative of their income level, and background in general, it might have a non-negligible predictive power. Therefore, we also want to include this feature in our model. We create the following features:
age: Age of the cardholder at the time of the transaction
distance_from_merch: The distance between the cardholder and the merchant
job_group: Employment group based on the cardholder’s job title
Code
dob = pl.col("dob").str.to_datetime("%d/%m/%Y")age = (tx_datetime.dt.year() - dob.dt.year()).cast(pl.UInt16)df.select("", tx_datetime=tx_datetime, dob=dob, age=age).head()
shape: (5, 4)
tx_datetime
dob
age
i64
datetime[μs]
datetime[μs]
u16
0
2020-06-21 12:14:00
1968-03-19 00:00:00
52
1
2020-06-21 12:14:00
1990-01-17 00:00:00
30
2
2020-06-21 12:14:00
1970-10-21 00:00:00
50
3
2020-06-21 12:15:00
1987-07-25 00:00:00
33
4
2020-06-21 12:15:00
1955-07-06 00:00:00
65
Job Group
Code
df.select("job").unique().sort("job")
shape: (478, 1)
job
str
"Academic librarian"
"Accountant, chartered certifie…
"Accountant, chartered public f…
"Accounting technician"
"Acupuncturist"
…
"Water engineer"
"Water quality scientist"
"Web designer"
"Wellsite geologist"
"Writer"
The job feature is a high-cardinality categorical variable (\(478\) unique occurrences) that can be grouped into more general categories. We group job titles into the following categories:
To create the job_group feature, we employ a combination of techniques. First, we generate text embeddings for each job title using the nomic-embed-text embedding model served locally via the Ollama platform. These embeddings represent the semantic meaning of the job titles in a high-dimensional space, capturing the nuances and similarities between them. Using these embeddings, we apply the KMeans clustering algorithm to group similar job titles together. This clustering automatically identifies patterns and groups jobs based on their textual similarities, without requiring predefined labels.
Once clusters are formed, a local large language model, (Qwen2.5-7B) served through Ollama is used to associate each cluster with a predefined job group from our JOB_GROUPS list. The model analyzes a sample of job titles from each cluster and determines the most appropriate job category, ensuring that the mapping is both accurate and aligned with industry standards.
For visualization, we reduce the dimensions of the high-dimensional embeddings using UMAP (Uniform Manifold Approximation and Projection). UMAP is preferred over PCA (Principal Component Analysis) because it preserves both the local and global structure of the data more effectively, providing a clearer and more meaningful visualization of the clustered job titles.
The table below shows that each unique job title is represented by a \(768\)-dimensional embedding vector generated by nomic-embed-text. The cluster column indicates the cluster to which each job title belongs, and the projection column is a two-dimensional representation of the embeddings obtained using UMAP.
Note that cluster is not easy to interpret at this point, as it is just an unsigned integer.
def infer_cluster_names( cluster_samples: list[dict], model: str="ollama:qwen2.5:7b") ->list[str]: names: list[str] = []class JobGroup(pydantic.BaseModel): name: str agent = pydantic_ai.Agent(model, result_type=JobGroup)for sample in tqdm( cluster_samples, desc=f"Inferring Job Group Names using LLM {model}" ): groups_list ="\n".join([f"- {group}"for group in JOB_GROUPS]) prompt =f"""Task: Categorize the following job titles into a single, specific job group name.Job Titles: {sample}Requirements:1. Provide ONLY ONE job group name2. Use standard industry categories3. Be generic4. Use title case format5. Maximum 2-3 words6. Don't include words like "Professional" or "Specialist"7. Focus on the core function/domainONLY CHOOSE FROM THE FOLLOWING JOB GROUPS, PRESERVE EXACT SPELLING:{groups_list}Response Format:Return ONLY the job group name without any additional text or explanation.Job Group Name:""" result = agent.run_sync(prompt).data job_group_name = result.name.strip() names.append(job_group_name)return names
We use Qwen2.5-7B large languag model with a simple prompt to create human-readable and interpretable labels for the clusters identified by KMeans.
The visualization below illustrates the job titles in a two-dimensional space, where each point represents a job title. The points are colored based on the cluster they belong to, showing the separation of job titles into distinct groups. This separation is a result of the clustering algorithm grouping similar job titles together based on their embeddings and then subsequently assigning them to predefined job groups via the language model.
Verify the Grouping
To check the accuracy of the clustering, simply hover over the points on the visualization. You’ll see the job titles and their corresponding clusters, which helps confirm that jobs are correctly grouped.
For example, job titles like “Software Engineer,” “Network Engineer,” and “Manufacturing Systems Engineer” are clustered closely together under the “Engineering” group. Similarly, titles like “Teacher, Primary School,” “Learning Mentor,” and “Music Tutor” are automatically assigned to the “Education” group.
We apply label encoding to binary features as it is a simple and effective method to convert categorical data into numerical values. Binary features inherently have only two categories, and label encoding maps these to 0 and 1, preserving their natural dichotomy. This ensures that the input is directly compatible with a wide range of machine learning algorithms, which typically require numerical inputs. Additionally, label encoding maintains the interpretability of the feature, as the numerical representation (0 or 1) clearly indicates the absence or presence of a characteristic. This approach is computationally efficient and does not introduce any artificial ordinal relationships.
We use label encoding for the city_pop_cat feature because it consists of categories with a clear order: “hamlet” (\(0\)), “village” (\(500\)), “town” (\(2,500\)), “city” (\(25,000\)), “metropolis” (\(1,000,000\)), and “megalopolis” (\(5,000,000\)). This encoding converts these ordered categories into numerical values, reflecting their progression from the smallest to the largest in terms of population size and complexity. By doing so, we preserve the meaningful hierarchy of these population centers, which helps models recognize their differences and relationships.
Code
ordinal_features_df = features_df.select( [pl.col(col).cast(pl.Categorical).to_physical() for col in ["city_pop_cat"]])ordinal_features_df.head()
shape: (5, 1)
city_pop_cat
u32
0
1
0
0
2
We use one-hot encoding for nominal features when there’s no inherent order among the categories. This method converts each category into a separate binary column. By dropping the first dummy column, we avoid redundancy and eliminate multicollinearity, preventing linear dependence. This approach allows the model to consider each category independently, without implying any hierarchy, and ensures compatibility with algorithms.
We analyze the temporal patterns of fraudulent transactions to identify any recurring trends or anomalies. This analysis can help us understand when fraud is most likely to occur and how it differs from legitimate transactions. We examine the distribution of fraudulent transactions across different time periods, such as hours of the day, days of the week, and months of the year.
# Filter fraudulent transactionsfraud_transactions = enriched_df_pandas[enriched_df_pandas["is_fraud"] ==True]# Count the number of fraudulent transactions by hour of the dayfraud_counts = fraud_transactions["hour_of_day"].value_counts().sort_index()# Create a color palette based on the countspalette = sns.color_palette("coolwarm", as_cmap=True)# Convert the palette to a listpalette_list = palette(fraud_counts.values /max(fraud_counts.values)).tolist()sns.set_theme(style="whitegrid", context="talk")# Plot the dataplt.figure(figsize=(10, 6))sns.barplot( x=fraud_counts.index, y=fraud_counts.values, hue=fraud_counts.index, palette=palette_list, dodge=False, legend=False,)plt.title("Fraudulent Transactions by Hour of the Day")plt.xlabel("Hour of the Day")plt.ylabel("Number of Fraudulent Transactions")plt.show()
Fraudulent Transactions by Day of the Week
Code
# Count the number of fraudulent transactions by hour of the dayfraud_counts_day = fraud_transactions["day_of_week"].value_counts().sort_index()# Create a color palette based on the countspalette = sns.color_palette("Blues", as_cmap=True)# Convert the palette to a listpalette_list = palette(fraud_counts_day.values /max(fraud_counts_day.values)).tolist()sns.set_theme(style="whitegrid", context="talk")# Plot the dataplt.figure(figsize=(10, 6))sns.barplot( x=fraud_counts_day.index, y=fraud_counts_day.values, hue=fraud_counts_day.index, palette=palette_list, dodge=False, legend=False,)plt.title("Fraudulent Transactions by Day of the Week")plt.xlabel("Day of the week")plt.ylabel("Number of Fraudulent Transactions")plt.xticks( ticks=fraud_counts_day.index -1, labels=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday", ],)plt.show()
The distribution of fraudulent transactions across the week is roughly even, with a slight increase observed on Sundays.
Correlation Between Transaction Frequency and Fraud Incidence Over Time
Code
# Convert the 'trans_date_trans_time' column to datetimeenriched_df_pandas["trans_date_trans_time"] = pd.to_datetime( enriched_df_pandas["trans_date_trans_time"])# Set the 'trans_date_trans_time' as the indexenriched_df_pandas.set_index("trans_date_trans_time", inplace=True)# Group by 'cc_num' and resample for different periodstransactions_per_day = enriched_df_pandas.groupby("cc_num").resample("D").size()transactions_per_week = enriched_df_pandas.groupby("cc_num").resample("W").size()transactions_per_two_weeks = enriched_df_pandas.groupby("cc_num").resample("2W").size()transactions_per_month = enriched_df_pandas.groupby("cc_num").resample("ME").size()# Reset index to make 'cc_num' a column againtransactions_per_day = transactions_per_day.reset_index(name="transactions_per_day")transactions_per_week = transactions_per_week.reset_index(name="transactions_per_week")transactions_per_two_weeks = transactions_per_two_weeks.reset_index( name="transactions_per_two_weeks")transactions_per_month = transactions_per_month.reset_index( name="transactions_per_month")
Code
# Create a figure with subplotsfig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))# Plot distributionssns.histplot(data=transactions_per_day, x="transactions_per_day", ax=ax1)ax1.set_title("Transactions per Day")ax1.set_xlabel("Number of Transactions")sns.histplot(data=transactions_per_week, x="transactions_per_week", ax=ax2)ax2.set_title("Transactions per Week")ax2.set_xlabel("Number of Transactions")sns.histplot(data=transactions_per_two_weeks, x="transactions_per_two_weeks", ax=ax3)ax3.set_title("Transactions per Two Weeks")ax3.set_xlabel("Number of Transactions")sns.histplot(data=transactions_per_month, x="transactions_per_month", ax=ax4)ax4.set_title("Transactions per Month")ax4.set_xlabel("Number of Transactions")plt.tight_layout()plt.show()
Code
# Merge enriched_df_pandas with transactions_per_day using a time-based joinmerged_day = pd.merge_asof( enriched_df_pandas.sort_values("trans_date_trans_time"), transactions_per_day.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("1D"), direction="nearest",)# Filter for fraudulent transactionsfraud_transactions_day = merged_day[merged_day["is_fraud"] ==True]# Merge enriched_df_pandas with transactions_per_week using a time-based joinmerged_week = pd.merge_asof( enriched_df_pandas.sort_values("trans_date_trans_time"), transactions_per_week.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("7D"), direction="nearest",)# Filter for fraudulent transactionsfraud_transactions_week = merged_week[merged_week["is_fraud"] ==True]# Merge enriched_df_pandas with transactions_per_two_weeks using a time-based joinmerged_two_weeks = pd.merge_asof( enriched_df_pandas.sort_values("trans_date_trans_time"), transactions_per_two_weeks.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("14D"), direction="nearest",)# Filter for fraudulent transactionsfraud_transactions_two_weeks = merged_two_weeks[merged_two_weeks["is_fraud"] ==True]# Merge enriched_df_pandas with transactions_per_month using a time-based joinmerged_month = pd.merge_asof( enriched_df_pandas.sort_values("trans_date_trans_time"), transactions_per_month.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("30D"), direction="nearest",)# Filter for fraudulent transactionsfraud_transactions_month = merged_month[merged_month["is_fraud"] ==True]# Merge the necessary DataFrames to include the required columns using a time-based joinmerged_all = pd.merge_asof( merged_day.sort_values("trans_date_trans_time"), transactions_per_week.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("7D"), direction="nearest",)merged_all = pd.merge_asof( merged_all.sort_values("trans_date_trans_time"), transactions_per_two_weeks.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("14D"), direction="nearest",)merged_all = pd.merge_asof( merged_all.sort_values("trans_date_trans_time"), transactions_per_month.sort_values("trans_date_trans_time"), on="trans_date_trans_time", by="cc_num", tolerance=pd.Timedelta("30D"), direction="nearest",)
Code
# Calculate the correlation matrix for the relevant columnscorrelation_matrix = merged_all[ ["transactions_per_day","transactions_per_week","transactions_per_two_weeks","transactions_per_month","is_fraud", ]].corr()# Rename the columns and index for better readabilitycorrelation_matrix.columns = ["Transactions per Day","Transactions per Week","Transactions per Two Weeks","Transactions per Month","Is Fraud",]correlation_matrix.index = ["Transactions per Day","Transactions per Week","Transactions per Two Weeks","Transactions per Month","Is Fraud",]# Plot the heatmapplt.figure(figsize=(10, 6))sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", vmin=-1, vmax=1, fmt=".2f")plt.title("Correlation between Transaction Frequency and Fraud")plt.show()
No correlation found
Seasonal Patterns in Daily Fraudulent Transaction Frequency
Note: The monthly patterns are omitted as the time series only covers six months of a single year, limiting the ability to derive meaningful insights.
# Plot the number of fraudlent transactions against date using a line plotplt.figure(figsize=(10, 6))enriched_df_pandas[enriched_df_pandas["is_fraud"] ==True].groupby("date").size().plot()plt.title("Number of Fraudulent Transactions by Date")plt.xlabel("Date")plt.ylabel("Number of Fraudulent Transactions")plt.show()
Code
# The data is from 2020-06-21 to 2020-12-31print(enriched_df_pandas["date"].sort_values(ascending=False).head())print(enriched_df_pandas["date"].sort_values(ascending=False).tail())
The ACF decays quickly after lag 1, and the PACF has a single significant spike at lag 1.
This suggests that if we wanted to fit an ARIMA model, we could try adding AR(1) part. Also, experiment with the MA part.
Time Series Decomposition
Code
# Ensure date is datetime and set as indexfraud_count_over_time.index = pd.to_datetime(fraud_count_over_time.index)# Decompose (use 'additive' or 'multiplicative' based on data behavior)decomposition = seasonal_decompose( fraud_count_over_time["fraud_count"], model="additive", period=7) # Weekly pattern# Plot the decomposition with a larger figure sizefig = decomposition.plot()fig.set_size_inches(15, 10)
Trend (Second Plot): Captures the underlying direction of the data after smoothing. There are periods where the fraud count increases and periods where it decreases, but it is hard to describe it without additional analysis.
Seasonal (Third Plot): The seasonal component captures repeating short-term patterns. A clear pattern is visible, with consistent ups and downs that repeat approximately every 7 days (weekly pattern). Fraudulent transactions have a weekly seasonality, indicating a recurring pattern across specific days of the week.
Residual (Fourth Plot): The residual component represents the noise or irregularities in the data after removing the trend and seasonality. The points are scattered around zero without any clear pattern, suggesting that most of the systematic structure (trend and seasonality) has been successfully captured. A few outliers are visible, representing unexpected spikes in fraudulent transactions.
The tall spike at frequency 0 means the data has a strong trend—it’s moving up or down over time rather than just repeating patterns.
There aren’t any big spikes at other frequencies, which means there’s no clear repeating cycle or seasonality standing out.
The rest of the plot looks like random noise, showing no obvious pattern in the other frequencies.
The data has a strong overall trend but no strong repeating seasonal pattern. The rest looks mostly random.
The overall trend dominates the data (long-term changes in fraud count). Weekly patterns exist but are weak and periodic, adding small cycles on top of the larger trend. The random noise contributes significantly to the data, which might obscure clearer seasonal patterns.
Detrending Time Series Data for Stationarity Analysis
Code
fraud_count_over_time["rolling_mean"] = ( fraud_count_over_time["fraud_count"].rolling(window=12).mean())# Visualize the Trendfig, ax = plt.subplots(figsize=(12, 8))plt.plot(fraud_count_over_time["fraud_count"], label="Original Data")plt.plot( fraud_count_over_time["rolling_mean"], label="Rolling Mean (Trend)", color="red")plt.title("Trend Analysis")plt.xlabel("Date")plt.ylabel("Number of Fraudulent Transactions")plt.legend()plt.show()
Code
# Subtracting the Rolling Meanfraud_count_over_time["detrended"] = ( fraud_count_over_time["fraud_count"] - fraud_count_over_time["rolling_mean"])fig, ax = plt.subplots(figsize=(12, 8))plt.plot(fraud_count_over_time["detrended"], label="Detrended Data")plt.legend()plt.title("Detrended Data")plt.xlabel("Date")plt.ylabel("Number of Fraudulent Transactions")plt.show()
The peak is gone, and the frequencies are now spread out, suggesting that the long-term trend was successfully removed, and what’s left is likely a combination of seasonal patterns, noise, or higher-frequency components.
If there were strong seasonal components (e.g., weekly or monthly cycles), we’d expect visible peaks at specific non-zero frequencies. In this plot, no clear periodic frequency stands out, which suggests seasonality might not be very strong or seasonality is complex and not well captured in the frequency domain.
Residual Analysis
Code
# Test Ljunga-Boxaljung_box_result = acorr_ljungbox( decomposition.resid.dropna(), lags=[10], return_df=True)print("Ljung-Box Test Result:", ljung_box_result)
Ljung-Box Test Result: lb_stat lb_pvalue
10 65.351501 3.471721e-10
We reject the null hypothesis, that residuals are white nosie -> there exists some autocorrelation in the residuals.
# Decompose (use 'additive' or 'multiplicative' based on data behavior)decomposition = seasonal_decompose( fraud_count_over_time["detrended"].dropna(), model="additive", period=7) # Weekly pattern# Plot the decomposition with a larger figure sizefig = decomposition.plot()fig.set_size_inches(15, 10)
In the seasonal component, a strong, repeating seasonal pattern is still observed.
Fraudulent Transaction Forecasting with ARIMA
To understand the timeseries better, an ARIMA or SARIMA model could be fitted and assessed. We fit a model, that could give more insights into weather the weekly pattern is significant.
Code
# Split the data into train and test sets for the model - 7 days test datatrain_data = fraud_count_over_time["detrended"].dropna()[:-7]test_data = fraud_count_over_time["detrended"].dropna()[-7:]# Ensure train_data and test_data have a proper datetime indextrain_data.index = pd.to_datetime(train_data.index)test_data.index = pd.to_datetime(test_data.index)# Suppress specific warningswarnings.filterwarnings("ignore", category=UserWarning, module="statsmodels")warnings.filterwarnings("ignore", category=FutureWarning, module="statsmodels")sarima_model = SARIMAX(train_data, order=(1, 0, 1), seasonal_order=(1, 0, 0, 7))sarima_fit = sarima_model.fit()# Forecast for the next 7 dayssarima_forecast = sarima_fit.get_forecast(steps=7)sarima_forecast_index = pd.date_range(start=test_data.index[0], periods=7, freq="D")sarima_forecast_series = sarima_forecast.predicted_meansarima_forecast_series.index = sarima_forecast_indexsarima_conf_int = sarima_forecast.conf_int()sarima_conf_int.columns = ["lower", "upper"]sarima_conf_int.index = sarima_forecast_index# Test Ljung-Boxljung_box_result = acorr_ljungbox(sarima_fit.resid, lags=[10], return_df=True)print("Ljung-Box Test Result:", ljung_box_result)# Model summarysarima_summary = sarima_fit.summary()print(sarima_summary)sarima_in_sample = sarima_fit.predict( start=train_data.index[0], end=train_data.index[-1])# Ensure the indices are datetimesarima_in_sample.index = pd.to_datetime(sarima_in_sample.index)sarima_forecast_series.index = pd.to_datetime(sarima_forecast_series.index)plt.figure(figsize=(12, 8))# Ensure the index is in datetime formatfraud_count_over_time.index = pd.to_datetime(fraud_count_over_time.index)plt.plot( fraud_count_over_time["detrended"].dropna(), label="Actual Number of Frauds", color="blue",)plt.plot(sarima_in_sample, label="Fitted", color="orange")plt.plot(sarima_forecast_series, label="Forecast", color="green")plt.fill_between( sarima_conf_int.index, sarima_conf_int["lower"], sarima_conf_int["upper"], color="green", alpha=0.3,)plt.title("Fraud Count Forecast (SARIMA)")plt.xlabel("Date")plt.ylabel("Number of Frauds Commited")plt.legend()plt.show()
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 4 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 3.32243D+00 |proj g|= 1.03682D-01
At iterate 5 f= 3.31559D+00 |proj g|= 1.17500D-02
At iterate 10 f= 3.31534D+00 |proj g|= 1.53621D-04
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
4 12 14 1 0 0 1.227D-06 3.315D+00
F = 3.3153359691955040
CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
Ljung-Box Test Result: lb_stat lb_pvalue
10 11.0178 0.356133
SARIMAX Results
==========================================================================================
Dep. Variable: detrended No. Observations: 156
Model: SARIMAX(1, 0, 1)x(1, 0, [], 7) Log Likelihood -517.192
Date: Sun, 12 Jan 2025 AIC 1042.385
Time: 17:49:17 BIC 1054.584
Sample: 0 HQIC 1047.340
- 156
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.0349 0.150 -0.233 0.816 -0.328 0.258
ma.L1 0.5441 0.130 4.170 0.000 0.288 0.800
ar.S.L7 0.1188 0.084 1.416 0.157 -0.046 0.283
sigma2 44.2618 4.311 10.268 0.000 35.813 52.711
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 12.38
Prob(Q): 0.98 Prob(JB): 0.00
Heteroskedasticity (H): 1.03 Skew: 0.50
Prob(H) (two-sided): 0.92 Kurtosis: 3.95
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
This problem is unconstrained.
Model Coefficients
MA(1) is significant, indicating short-term memory in the time series.
AR(1) and Seasonal AR(7) are not significant, suggesting these terms might not be contributing much to the model.
sigma² (variance of residuals) is highly significant, indicating noise in the residuals.
Diagnostics Tests
Ljung-Box Test: The high p-value indicates that the residuals are uncorrelated, suggesting the model has successfully captured the autocorrelations in the data.
Jarque-Bera Test: Residuals are not normally distributed. This may impact confidence intervals and hypothesis testing.
Heteroskedasticity (H Test): No evidence of heteroskedasticity; residuals have constant variance over time.
Final Thoughts
Seasonal effects are less clear, and the AR terms seem unnecessary.
Residuals are uncorrelated and homoscedastic, but not normally distributed.
We can try to simplify the model by removing non-significant terms, but probably because we only have 6 months of data, we cannot discover some significant patterns.
Therefore we’ll drop the topic with the conclusion, that weekly seasonality is not significant and there may be other factors we cannot discover from the present dataset.
Findings
Fraudulent Transactions by Hour of the Day
Fraudulent transactions frequently occur during late-night hours, especially between 10:00 PM and 4:00 AM. This pattern might be attributed to reduced monitoring and increased vulnerability during nighttime.
Fraudulent Transactions by Day of the Week
Fraud is evenly distributed across the week, with a slight peak on Sundays.
Correlation Between Transaction Frequency and Fraud Incidence Over Time
No significant correlation found.
Seasonal Patterns in Daily Fraudulent Transaction Frequency
We analyze the time series by checking for stationarity, decomposing the series, and examining the frequency domain analysis plot. Below are the key conclusions:
Stationarity: The process is stationary, as confirmed by statistical tests.
Trend Removal: The overall trend dominates the data and is successfully removed. The series remains stationary after detrending.
Frequency Domain Analysis: No clear periodic frequency stands out, suggesting that seasonality might either be weak or complex and not well captured in the frequency domain.
Series Decomposition: Despite the unclear frequency domain results, a weekly pattern is still visible in the series decomposition.
SARIMA model is fitted to give a final conclusion:
Weekly seasonality is not significant for modeling the time series of fraud counts.
There may be other factors influencing the data that cannot be identified from the present dataset.
These findings suggest that while some seasonal patterns exist, they are not strong enough to meaningfully improve the model’s performance.
The analysis ultimately confirmed the initial visual insight (Fraudulent Transactions by Day of the Week), but the process added confidence and statistical validation to the conclusion.
Clustering
We investigate whether cluster analysis can be used to identify distinct groups of transactions based on their characteristics and whether we can separate fraudulent transactions from legitimate ones by clustering.
We set off by visualizing our prepared features in a two-dimensional space using UMAP (Uniform Manifold Approximation and Projection). UMAP is a dimensionality reduction technique that preserves both local and global structure, making it ideal for visualizing high-dimensional data in a lower-dimensional space. By projecting the data onto two dimensions, we can observe the relationships between transactions and identify potential clusters.
Code
n =10_000feature_projections_df = ( attach_projections( normalized_features_df.limit(n).select( vector=pl.concat_list("*"), is_fraud=df.limit(n).get_column("is_fraud").cast(pl.Boolean),**{ col: features_df.limit(n).get_column(col) for col in features_df.columns }, ) ) .sort("is_fraud") .select("vector","projection", pl.exclude("vector", "projection", "is_fraud"),"is_fraud", ))feature_projections_df.head()
shape: (5, 15)
vector
projection
tx_hour
tx_day_of_week
tx_is_weekend
tx_category
distance_from_merch
city_pop_cat
gender
age
job_group
amt
amt_deviation
is_frequently_visited_merchant
is_fraud
list[f64]
list[f32]
i8
i8
bool
str
f64
str
bool
u16
str
f64
f64
bool
bool
[12.0, 7.0, … 0.0]
[-4.147266, -10.594178]
12
7
true
"personal_care"
24613.746071
"city"
false
52
"Engineering"
2.86
3.695
false
false
[12.0, 7.0, … 0.0]
[16.207123, 2.096534]
12
7
true
"personal_care"
104834.043428
"hamlet"
true
30
"Creative"
29.84
9.14
false
false
[12.0, 7.0, … 0.0]
[3.748188, 6.139663]
12
7
true
"health_fitness"
59204.795631
"city"
true
50
"Education"
41.28
17.34
false
false
[12.0, 7.0, … 0.0]
[-7.032322, -8.113666]
12
7
true
"misc_pos"
27615.117073
"city"
false
33
"Creative"
60.05
0.0
false
false
[12.0, 7.0, … 0.0]
[10.784208, -1.481717]
12
7
true
"travel"
104423.174625
"village"
false
65
"Creative"
3.19
0.0
false
false
The visualization below demonstrates that transactions are organized into non-spherical clusters, suggesting they can be grouped effectively. However, it also becomes evident that in the existing feature space, distinguishing between fraudulent and legitimate transactions is challenging. The orange data points, representing fraudulent transactions, are dispersed throughout the plot, indicating the difficulty in separating them from legitimate ones.
We explored various modifications to the feature space and applied clustering algorithms such as KMeans, DBSCAN, and agglomerative hierarchical clustering. Despite these efforts, we were unable to determine that any of these methods effectively distinguish between fraudulent and legitimate transactions.
Therefore, we shifted our attention to supervised learning methods, which are better suited for classification tasks and can leverage the labeled data to identify patterns that separate fraudulent from legitimate transactions.
Classifier Training & Selection
In what follows, we explore how to train and select classifiers to distinguish between fraudulent and legitimate transactions, a task made complex by the rare occurrence of fraud in credit card data.
To address class imbalance, we apply SMOTE for oversampling the minority class and NearMiss for undersampling the majority class, resulting in three datasets: original, SMOTE-oversampled, and NearMiss-undersampled. These techniques help our models better identify fraud without bias toward the more frequent legitimate transactions.
We train decision tree and random forest classifiers, which excel in this task. Decision trees are interpretable and capture nonlinear patterns, while random forests, as ensembles of trees, enhance accuracy and robustness by mitigating overfitting.
We evaluate each model using classification reports and confusion matrices, focusing on accuracy, precision, recall, and F1-score:
Accuracy indicates overall correctness but is less reliable in imbalanced data.
Precision shows the correctness of fraud predictions, minimizing false alarms.
Recall ensures all fraud cases are detected, preventing missed fraud.
F1-score balances precision and recall, key for catching fraud while minimizing errors.
We also analyze feature importance to understand which attributes contribute most to fraud detection.
Code
X = normalized_features_df.to_numpy()y = df.select(is_fraud).to_numpy().squeeze()
Code
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42,)
for result in results: display_model_evaluation(result)
Decision Tree Trained on Original Samples
Random Forest Trained on Original Samples
Decision Tree Trained on Samples Oversampled with SMOTE
Random Forest Trained on Samples Oversampled with SMOTE
Decision Tree Trained on Samples Undersampled with NearMiss
Random Forest Trained on Samples Undersampled with NearMiss
Model Discussion
Model
Class-wise Performance
Confusion Matrix
Overall Metrics
Feature Importance
Decision Tree on Original Samples
High accuracy for legitimate transactions, lower for fraud
Struggles with true positives and false negatives
High accuracy, imbalance issues evident
Transaction amount, category, time
Random Forest on Original Samples
Slightly better precision, lower recall for fraud
Similar challenge in identifying all fraud cases
High macro precision, lower macro recall
Transaction amount, time
Decision Tree with SMOTE Oversampling
Precision decreases, recall similar for fraud
Better balance but still misses fraud cases
Slightly lower accuracy, oversampling effect
Transaction amount, frequent merchant visits
Random Forest with SMOTE Oversampling
Precision improves, recall decreases for fraud
True positive prediction issues for fraud
Good macro precision, weaker recall, trade-off
Transaction amount, gender, merchant frequency
Decision Tree with NearMiss Undersampling
High recall for fraud, very low precision
Many false positives, lacks precision
Accuracy drops significantly, recall-focused
Transaction amount
Random Forest with NearMiss Undersampling
Balanced recall, low precision for fraud
Similar false positives challenges
Better recall and accuracy, limited precision
Transaction amount, time-related factors
Model Selection
In conclusion, random forests generally offer better precision, while decision trees are quicker and simpler, making them a viable option depending on the specific needs of the task. To handle class imbalance, SMOTE is effective in enhancing recall by minimizing false negatives, whereas NearMiss provides a balance by addressing both types of errors. The transaction amount consistently emerges as a crucial feature in fraud detection, underscoring its importance.
While false alarms can be annoying to customers, missed fraud cases can lead to significant financial losses. Therefore, a classifier trained on the undersampled dataset with NearMiss might be a better choice, as it identifies a much higher proportion of fraudulent transactions, albeit with more false positives. This trade-off is crucial for credit card companies to consider when selecting a model for fraud detection.